Part of iPyMacLern project.

Copyright (C) 2015 by Eka A. Kurniawan

eka.a.kurniawan(ta)gmail(tod)com

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Tested On


In [1]:
import sys
print("Python %d.%d.%d" % (sys.version_info.major, \
                           sys.version_info.minor, \
                           sys.version_info.micro))


Python 3.3.5

In [2]:
import numpy as np
print("NumPy %s" % np.__version__)


NumPy 1.8.0

In [3]:
import matplotlib
import matplotlib.pyplot as plt
print("matplotlib %s" % matplotlib.__version__)


matplotlib 1.3.1

Display Settings


In [4]:
# Display graph inline
%matplotlib inline

# Display graph in 'retina' format for Mac with retina display. Others, use PNG or SVG format.
%config InlineBackend.figure_format = 'retina'
#%config InlineBackend.figure_format = 'PNG'
#%config InlineBackend.figure_format = 'SVG'

# For displaying 3D graph
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm

Linear Regression With One Variable[1]


The data we use if form different outlets accross multiple cities consists of the profit of each outlet based on the number of population in the city the outlet located. Using this known data, we can perform linear regression analysis among the two variables (profit vs population) and then we can use the result to predict the profit we are going to get for our new outlet if we know the population of the city where the new outlet is going to be located.

Get and Plot Data


In [5]:
filename = 'ex1data1.txt'
data = np.recfromcsv(filename, delimiter=',', names=['population', 'profit'])

In [6]:
plt.scatter(data['population'], data['profit'], color='red', s=2)
plt.title('Restaurant Sells')
plt.xlabel('Population of City in 10,000s')
plt.ylabel('Profit in $10,000s')
plt.show()



In [7]:
# Total training samples
m = len(data)

# Add column of ones to variable x
X = np.append(np.ones([m,1]),data['population'].reshape(m,1),1)

# Output
y = data['profit']

# Initialize fitting parameters
theta = np.zeros([2, 1])

Cost Function

For $x$ as input variable (feature) and $y$ as output/target variable, we have cost function $J$ of parameter $\theta$ as follow.

$$J(\theta) = \frac{1}{2m} \sum_{i=1}^{m}\left(h_{\theta}(x^{(i)}) - y^{(i)}\right)^2$$

Where hypothesis $h_{\theta}(x)$ is the linear model as follow.

$$h_{\theta}(x) = \theta^Tx = \theta_0 + \theta_1x$$

And $m$ is total training samples and $i$ is the index.


In [8]:
def compute_cost(X, y, theta):
    # Total training samples
    m = len(y)
    
    h = theta[0] + (theta[1] * X[:,1])
    J = (1 / (2*m)) * sum((h - y) ** 2)
    return J

In [9]:
J = compute_cost(X, y, theta)
print(J)


32.0727338775

Gradient Descent Algorithm

Our objective is to have the parameters ($\theta_0$ and $\theta_1$) to produce a hypothesis $h_{\theta}(x)$ as close as possible to target variable $y$. As we have defined our cost function $J(\theta)$ as the error of the two therefore all we need to do is to minimize the cost function $\left(\underset{\theta_0, \theta_1}{\text{minimize}} \hspace{2mm} J(\theta_0, \theta_1)\right)$.

Following is the algorithm where $\alpha$ is the learning rate.

repeat until convergence {

$\hspace{10mm} \theta_j := \theta_j - \alpha \frac{\partial}{\partial\theta_j} J(\theta_0, \theta_1) \hspace{10mm} (\text{for } j=0 \text{ and } j=1)$

}

Derive cost function $J(\theta_j)$ for both $j=0$ and $j=1$.

$j = 0$ : $$\frac{\partial}{\partial\theta_0} J(\theta_0, \theta_1) =$$ $$\frac{\partial}{\partial\theta_0} \frac{1}{2m} \sum_{i=1}^{m}\left(h_{\theta}(x^{(i)}) - y^{(i)}\right)^2 =$$ $$\frac{\partial}{\partial\theta_0} \frac{1}{2m} \sum_{i=1}^{m}\left(\theta_0 + \theta_1x^{(i)}-y^{(i)}\right)^2 =$$ $$\frac{1}{m} \sum_{i=1}^{m}\left(\theta_0 + \theta_1x^{(i)} - y^{(i)}\right) =$$ $$\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}(x^{(i)}) - y^{(i)}\right)$$

$j = 1$ : $$\frac{\partial}{\partial\theta_1} J(\theta_0, \theta_1) =$$ $$\frac{\partial}{\partial\theta_1} \frac{1}{2m} \sum_{i=1}^{m}\left(h_{\theta}(x^{(i)}) - y^{(i)}\right)^2 =$$ $$\frac{\partial}{\partial\theta_1} \frac{1}{2m} \sum_{i=1}^{m}\left(\theta_0 + \theta_1x^{(i)}-y^{(i)}\right)^2 =$$ $$\frac{1}{m} \sum_{i=1}^{m}\left(\theta_0 + \theta_1x^{(i)} - y^{(i)}\right) . x^{(i)} =$$ $$\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}(x^{(i)}) - y^{(i)}\right) . x^{(i)}$$

For this linear case, the algorithm would be as follow.

repeat until convergence {

$\hspace{10mm} \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}(x^{(i)}) - y^{(i)}\right)$

$\hspace{10mm} \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}(x^{(i)}) - y^{(i)}\right) . x^{(i)}$

}


In [10]:
def perform_gradient_descent(X, y, theta, alpha, iterations):
    # Total training samples
    m = len(y)
    # History of costs
    J_history = np.zeros([iterations, 1])
    
    for i in range(iterations):
        h = theta[0] + (theta[1] * X[:,1])
        error = h - y
        theta_tmp1 = theta[0] - ((alpha / m) * sum(error))
        theta_tmp2 = theta[1] - ((alpha / m) * sum(error * X[:,1]))
        theta[0] = theta_tmp1
        theta[1] = theta_tmp2
        
        J_history[i] = compute_cost(X, y, theta)
        
    return theta, J_history

In [11]:
iterations = 1500
alpha = 0.01

theta, J_history = perform_gradient_descent(X, y, theta, alpha, iterations)

In [12]:
theta


Out[12]:
array([[-3.63029144],
       [ 1.16636235]])

In [13]:
J_history


Out[13]:
array([[ 6.73719046],
       [ 5.93159357],
       [ 5.90115471],
       ..., 
       [ 4.48343473],
       [ 4.48341145],
       [ 4.48338826]])

Plot Cost History


In [14]:
plt.plot(J_history)
plt.title('Cost History')
plt.xlabel('Iterations')
plt.ylabel(r'Cost $J(\theta)$')
plt.show()


Plot Linear Fit


In [15]:
plt.scatter(data['population'], data['profit'], color='red', s=2)
plt.plot(data['population'], np.dot(X, theta), color='blue')
plt.title('Restaurant Sells')
plt.xlabel('Population of City in 10,000s')
plt.ylabel('Profit in $10,000s')
plt.show()


Make Prediction

Predict profit for cities with population of 35,000 and 70,000 people.


In [16]:
ttl_population = 35000
print('For population = 35,000, we predict a profit of %s' % \
      int(np.dot([1, ttl_population/10000], theta) * 10000))


For population = 35,000, we predict a profit of 4519

In [17]:
ttl_population = 70000
print('For population = 70,000, we predict a profit of %s' % \
      int(np.dot([1, ttl_population/10000], theta) * 10000))


For population = 70,000, we predict a profit of 45342

Visualize Cost for Different Range of Parameters


In [18]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.view_init(azim =-120)

theta0_vals = np.linspace(-10, 10, 100)
theta1_vals = np.linspace(-1, 4, 100)
theta0_mesh, theta1_mesh = np.meshgrid(theta0_vals, theta1_vals)
zs = np.array([compute_cost(X, y, [theta0_val,theta1_val]) for theta0_val,theta1_val in zip(np.ravel(theta0_mesh), np.ravel(theta1_mesh))])
Z = zs.reshape(theta0_mesh.shape)

ax.plot_surface(theta0_mesh, theta1_mesh, Z, cmap=cm.coolwarm)
ax.set_xlabel('Theta 0')
ax.set_ylabel('Theta 1')
ax.set_zlabel(r'Cost $J(\theta)$')

plt.show()


References

  1. Andrew Y. Ng, 2015. Machine Learning. Week 1: Linear Regression with One Variable. Stanford. Coursera. https://www.coursera.org/learn/machine-learning